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Abstract 

We describe a statistical model over linguis- 
tic areas and phylogeny. Our model recov- 
ers known areas and identifies a plausible hi- 
erarchy of areal features. The use of areas 
improves genetic reconstruction of languages 
both qualitatively and quantitatively according 
to a variety of metrics. We model linguistic 
areas by a Pitman- Yor process and linguistic 
phylogeny by Kingman's coalescent. 

1 Introduction 

Why are some languages more alike than others? 
This question is one of the most central issues in his- 
torical Unguistics. Typically, one of three answers 
is given (Aikhenvald and Dixon, 2001; Campbell, 
2006). First, the languages may be related "genet- 
ically." That is, they may have all derived from a 
common ancestor language. Second, the similarities 
may be due to chance. Some language properties 
are simply more common than others, which is of- 
ten attributed to be mostly due to linguistic univer- 
sals (Greenberg, 1963). Third, the languages may 
be related areally. Languages that occupy the same 
geographic area often exhibit similar characteristics, 
not due to genetic relatedness, but due to sharing. 
Regions (and the languages contained within them) 
that exhibit sharing are called linguistic areas and 
the features that are shared are called areal features. 

Much is not understood or agreed upon in the field 
of areal linguistics. Different linguists favor differ- 
ent defintions of what it means to be a linguistic area 
(are two languages sufficient to describe an area or 
do you need three (Thomason, 2001; Katz, 1975)?), 



what areal features are (is there a linear ordering of 
"borrowability" (Katz, 1975; Cumow, 2001) or is 
that too prescriptive?), and what causes sharing to 
take place (does social status or number of speakers 
play a role (Thomason, 2001)?). 

In this paper, we attempt to provide a statistical 
answer to some of these questions. In particular, 
we develop a Bayesian model of typology that al- 
lows for, but does not force, the existence of linguis- 
tic areas. Our model also allows for, but does not 
force, preference for some feature to be shared are- 
ally. When applied to a large typological database 
of linguistic features (Haspelmath et al., 2005), we 
find that it discovers linguistic areas that are well 
documented in the literature (see Campbell (2005) 
for an overview), and a small preference for cer- 
tain features to be shared areally. This latter agrees, 
to a lesser degree, with some of the published hi- 
erarchies of borrowability (Curnow, 2001). Finally, 
we show that reconstructing language family trees is 
significantly aided by knowledge of areal features. 
We note that Wamow et al. (2005) have indepen- 
dently proposed a model for phonological change in 
Indo-European (based on the Dyen dataset (Dyen et 
al., 1992)) that includes notions of borrowing. Our 
model is different in that we (a) base our model on 
typological features rather than just lexical patterns 
and (b) we explicitly represent language areas, not 
just one-time borrowing phenomena. 

2 Background 

We describe (in Section 3) a non-parametric, hier- 
archical Bayesian model for finding linguistic areas 
and areal features. In this section, we provide nec- 
essary background — both linguistic and statistical — 



for understanding our model. 
2.1 Areal Linguistics 

Areal effects on linguistic typology have been stud- 
ied since, at least, the late 1920s by Trubetzkoy, 
though the idea of tracing family trees for languages 
goes back to the mid 1800s and the comparative 
study of historical linguistics dates back, perhaps to 
Giraldus Cambrenis in 1194 (Campbell, In press). 
A recent article provides a short introduction to both 
the issues that surround areal linguistics, as well as 
an enumeration of many of the known language ar- 
eas (Campbell, 2005). A fairly wide, modern treat- 
ment of the issues surrounding areal diffusion is also 
given by essays in a recent book edited by Aikhen- 
vald and Dixon (2001). The essays in this book pro- 
vide a good introduction to the issues in the field. 
Campbell (2006) provides a critical survey of these 
and other hypotheses relating to areal linguistics. 

There are several issues which are basic to the 
study of areal linguistics (these are copied almost 
directly from Campbell (2006)). Must a linguistic 
area comprise more than two languages? Must it 
comprise more than one language family? Is a sin- 
gle trait sufficient to define an area? How "nearby" 
must languages in an area be to one another? Are 
some feature more easily borrowed that others? 

Despite these formal definitional issues of what 
constitutes a language area and areal features, most 
historical linguists seem to believe that areal effects 
play some role in the change of languages. 

2.1.1 Established Linguistic Areas 

Below, we list some of the well-known linguistic 
areas; Campbell (2005) provides are more complete 
listing together with example areal features for these 
areas. For each area, we list associated languages: 
The Balkans: Albanian, Bulgarian, Greek, Mace- 
donian, Rumanian and Serbo-Croatian. {Sometimes: 
Romani and Turkish) 

South Asian: Languages belonging to the Dravid- 
ian, Indo-Aryan, Munda, Tibeto-Burman families. 
Meso-America: Cuitlatec, Huave, Mayan, Mixe- 
Zoquean, Nahua, Otomanguean, Tarascan, Tequist- 
latecan, Totonacan and Xincan. 
North-west America: Alsea, Chimakuan, Coosan, 
Eyak, Haida, Kalapuyan, Lower Chinook, SaUshan, 
Takelman, Tlingit, Tsimshian and Wakashan. 
The Baltic: Baltic languages, Baltic German, and 



Finnic languages (especially Estonian and Livo- 
nian). (Sometimes many more are included, such as: 
Belorussian, Lavian, Lithuanian, Norwegian, Old 
Prussian, Polish, Romani, Russian, Ukranian.) 
Ethiopia: Afar, Amharic, Anyuak, Awngi, Beja, 
Ge'ez, Gumuz, Janjero, Kefa, Sidamo, Somali, Ti- 
gre, Tigrinya and Wellamo. 

Needless to say, the exact definition and extent of 
the actual areas is up to significant debate. More- 
over, claims have been made in favor of many lin- 
guistic areas not defined above. For instance, Dixon 
(2001) presents arguments for several Australian lin- 
guistic areas and Matisoff (2001) defines a South- 
East Asian language area. Finally, although "folk 
lore" is in favor of identifying a linguistic area in- 
cluding English, French and certain Norse languages 
(Norwegian, Swedish, Low Dutch, High German, 
etc.), there are counter-arguments to this position 
(Thomason, 2001) (see especially Case Study 9.8). 

2.1.2 Linguistic Features 

Identifying which linguistic features are most eas- 
ily shared "areally" is a long standing problem in 
contact linguistics. Here we briefly review some of 
the major claims. Much of this overview is adoped 
from the summary given by Curnow (2001). 

Haugen (1950) considers only borrowability as 
far as the lexicon is concerned. He provided evi- 
dence that nouns are the easiest, followed by verbs, 
adjectives, adverbs, prepositions, etc. Ross (1988) 
corroborates Haugen's analysis and deepens it to 
cover morphology, syntax and phonology. He pro- 
poses the following hierarchy of borrowability (eas- 
iest items coming first): nouns > verbs > adjectives 
> syntax > non-bound function words > bound 
morphemes > phonemes. Coming from a "con- 
straints" perspective, Moravcsik (1978) suggests 
that: lexical items must be borrowed before lexi- 
cal properties; inflected words before bound mor- 
phemes; verbal items can never be borrowed; etc. 

Curnow (2001) argues that coming up with a rea- 
sonable hierarchy of borrowability is that "we may 
never be able to develop such constraints." Never- 
theless, he divides the space of borrowable features 
into 15 categories and discusses the evidence sup- 
porting each of these categories, including: phonet- 
ics (rare), phonology (common), lexical (very com- 
mon), interjections and discourse markers (com- 



mon), free grammatical forms (occasional), bound 
grammatical forms (rare), position of morphology 
(rare), syntactic frames (rare), clause-internal syntax 
(common), between-clause syntax (occasional). 

2.2 Non-parametric Bayesian Models 

We treat the problem of understanding areal Unguis- 
tics as a statistical question, based on a database of 
typological information. Due to the issues raised in 
the previous section, we do not want to commit to 
the existence of a particular number of linguistic ar- 
eas, or particular sizes thereof. (Indeed, we do not 
even want to commit to the existence of any linguis- 
tic areas.) However, we will need to "unify" the 
languages that fall into a linguistic area (if such a 
thing exists) by means of some statistical param- 
eter. Such problems have been studied under the 
name non-parametric models. The idea behind non- 
parametric models is that one does not commit a pri- 
ori to a particularly number of parameters. Instead, 
we allow the data to dictate how many parameters 
there are. In Bayesian modeling, non-parametric 
distributions are typically used as priors; see Jor- 
dan (2005) or Ghahramani (2005) for overviews. In 
our model, we use two different non-parametric pri- 
ors: the Pitman- Yor process (for modeling linguistic 
areas) and Kingman's coalescent (for modeling lin- 
guistic phylogeny), both described below. 

2.2.1 The Pitman- Yor Process 

One particular example of a non-parametric prior 
is the Pitman- Yor process (Pitman and Yor, 1997), 
which can be seen as an extension to the better- 
known Dirichlet process (Ferguson, 1974). The 
Pitman- Yor process can be understood as a particu- 
lar example of a Chinese Restaurant process (CRP) 
(Pitman, 2002). The idea in all CRPs is that there 
exists a restaurant with an infinite number of ta- 
bles. Customers come into the restaurant and have 
to choose a table at which to sit. 

The Pitman- Yor process is described by three pa- 
rameters: a base rate a, a discount parameter d and 
a mean distribution Gq. These combine to describe 
a process denoted by Vy{a, d, Gq). The parameters 
a and d must satisfy: < d < 1 and a > —d. In 
the CRP analogy, the model works as follows. The 
first customer comes in and sits at any table. After 
N customers have come in and seated themselves 
(at a total of K tables), the Ni\\ customer arrives. In 



the Pitman- Yor process, the TVth customer sits at a 
new table with probability proportional to a + Kd 
and sits at a previously occupied table k with proba- 
bility proportional to #fe — d, where #fc is the num- 
ber of customers already seated at table k. Finally, 
with each table k we associate a parameter 9^, with 
each Ok drawn independently from Gq. An impor- 
tant property of the Pitman- Yor process is that draws 
from it are exchangable: perhaps counterintuitively, 
the distribution does not care about customer order. 

The Pitman- Yor process induces a power-law dis- 
tribution on the number of singleton tables (i.e., the 
number of tables that have only one customer). This 
can be seen by noticing two things. In general, 
the number of singleton tables grows as 0{aN'^). 
When d = 0, we obtain a Dirichlet process with the 
number of singleton tables growing as 0{a log N). 

2.2.2 Kingman's Coalescent 

Kingman's coalescent is a standard model in pop- 
ulation genetics describing the common genealogy 
(ancestral tree) of a set of individuals (Kingman, 
1982b; Kingman, 1982a). In its full form it is a dis- 
tribution over the genealogy of a countable set. 

Consider the genealogy of n individuals alive at 
the present time t = 0. We can trace their ances- 
try backwards in time to the distant past t = -<xi. 
Assume each individual has one parent (in genet- 
ics, haploid organisms), and therefore genealogies 
of [n] = {1, . . . , n} form a directed forest. King- 
man's n-coalescent is simply a distribution over ge- 
nealogies of n individuals. To describe the Markov 
process in its entirety, it is sufficient to describe 
the jump process (i.e. the embedded, discrete-time, 
Markov chain over partitions) and the distribution 
over coalescent times. In the n-coalescent, every 
pair of lineages merges independently with rate 1, 
with parents chosen uniformly at random from the 
set of possible parents at the previous time step. 

The n-coalescent has some interesting statistical 
properties (Kingman, 1982b; Kingman, 1982a). The 
marginal distribution over tree topologies is uni- 
form and independent of the coalescent times. Sec- 
ondly, it is infinitely exchangeable: given a geneal- 
ogy drawn from an n-coalescent, the genealogy of 
any m contemporary individuals alive at time t < 
embedded within the genealogy is a draw from the 
m-coalescent. Thus, taking n — > oo, there is a distri- 



bution over genealogies of a countably infinite pop- 
ulation for which the marginal distribution of the ge- 
nealogy of any n individuals gives the n-coalescent. 
Kingman called this the coalescent. 

Teh et al. (2007) recently described efficient in- 
ference algorithms for Kingman's coalescent. They 
applied the coalescent to the problem of recovering 
linguistic phylogenies. The application was largely 
successful — at least in comparison to alternative al- 
gorithms that use the same data-. Unfortunately, 
even in the results they present, one can see signif- 
icant areal effects. For instance, in their Figure(3a), 
Romanian is very near Albanian and Bulgarian. This 
is likely an areal effect: specifically, an effect due to 
the Balkan langauge area. We will revisit this issue 
in our own experiments. 

3 A Bayesian Model for Areal Linguistics 

We will consider a data set consisting of N lan- 
guages and F typological features. We denote the 
value of feature / in language n as Xnj. For sim- 
plicity of exposition, we will assume two things: (1) 
there is no unobserved data and (2) all features are 
binary. In practice, for the data we use (described in 
Section 4), neither of these is true. However, both 
extensions are straightforward. 

When we construct our model, we attempt to be 
as neutral to the "areal linguistics" questions defined 
in Section 2. 1 as possible. We allow areas with only 
two languages (though for brevity we do not present 
them in the results). We allow areas with only one 
family (though, again, do not present them). We are 
generous with our notion of locality, allowing a ra- 
dius of 1000 kilometers (though see Section 5.4 for 
an analysis of the effect of radius).^ And we allow, 
but do not enforce trait weights. All of this is ac- 
complished through the construction of the model 
and the choice of the model hyperparameters. 

At a high-level, our model works as follows. Val- 
ues Xnj appear for one of two reasons: they are ei- 
ther areally derived or genetically derived. A latent 
variable Znj determines this. If it is derived areally, 
then the value X^j is drawn from a latent variable 

'An reader might worry about exchangeability: Our method 
of making language centers and locations part of the Pitman- Yor 
distribution ensures this is not an issue. An alternative would 
be to use a location-sensitive process such as the kernel stick- 
breaking process (Dunson and Park, 2007), though we do not 
explore that here. 



corresponding to the value preferences in the lan- 
gauge area to which language n belongs. If it is de- 
rived genetically, then X^j is drawn from a variable 
corresponding to value preferences for the genetic 
substrate to which language n belongs. The set of 
areas, and the area to which a language belongs are 
given by yet more latent variables. It is this aspect of 
the model for which we use the Pitman- Yor process: 
languages are customers, areas are tables and area 
value preferences are the parameters of the tables. 

3.1 The formal model 

We assume that the value a feature takes for a par- 
ticular language (i.e., the value of Xnj) can be ex- 
plained either genetically or areally.-^ We denote this 
by a binary indicator variable Znj, where a value 1 
means "areal" and a value means "genetic." We as- 
sume that each Znj is drawn from a feature-specific 
binomial parameter tt/. By having the parameter 
feature-specific, we express the fact that some fea- 
tures may be more or less likely to be shared than 
others. In other words, a high value of tt/ would 
mean that feature / is easily shared areally, while a 
low value would mean that feature / is hard to share. 
Each language n has a known latitude/longitude 

We further assume that there are K linguistic ar- 
eas, where K is treated non-parametrically by means 
of the Pitman- Yor process. Note that in our context, 
a linguistic area may contain only one language, 
which would technically not be allowed according 
to the linguistic definition. When a language belongs 
to a singleton area, we interpret this to mean that it 
does not belong to any language area. 

Each language area k (including the singleton ar- 
eas) has a set of F associated parameters j, where 
0fc J is the probability that feature / is "on" in area k. 
It also has a "central location" given by a longitude 
and latitude denoted Cfe. We only allow languages 
to belong to areas that fall within a given radius R 
of them (distances computed according to geodesic 
distance). This accounts for the "geographical" con- 
straints on language areas. We denote the area to 
which language n belongs as a^. 

We assume that each language belongs to a "fam- 
ily tree." We denote the parent of language n in the 

^ As mentioned in the introduction, (at least) one more option 
is possible: chance. We treat "chance" as noise and model it in 
the data generation process, not as an alternative "source." 
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Figure 1: Full hierarchical Areal model; see Section 3.1 for a complete description. 



family tree by Pn- We associate with each node i in 
the family tree and each feature / a parameter 6ij. 
As in the areal case, 6ij is the probability that fea- 
ture / is on for languages that descend from node i 
in the family tree. We model genetic trees by King- 
man's coalescent with binomial mutation. 

Finally, we put non-informative priors on all the 
hyperparameters. Written hierarchically, our model 
has the following shown in Figure 1. There, by 
{p, 6) ~ Coalescent(7ro, mo), we mean that the tree 
and parameters are given by a coalescent. 

3.2 Inference 

Inference in our model is mostly by Gibbs sam- 
pling. Most of the distributions used are conju- 
gate, so Gibbs sampling can be implemented effi- 
ciently. The only exceptions are: (1) the coales- 
cent for which we use the GreedyRatel algorithm 
described by Teh et al. (2007); (2) the area centers c, 
for which we using a Metropolis-Hastings step. Our 
proposal distribution is a Gaussian centered at the 
previous center, with standard deviation of 5. Ex- 
perimentally, this resulted in an acceptance rate of 
about 50%. 

In our implementation, we analytically integrate 
out TT and ^ and sample only over Z, the coalescent 
tree, and the area assignments. In some of our ex- 
periments, we treat the family tree as given. In this 
case, we also analytically integrate out the 6 param- 
eters and sample only over Z and area assignments. 

4 Typological Data 

The database on which we perform our analysis is 

the World Atlas of Language Structures (henceforth, 
WALS) (Haspelmath et al., 2005). The database 
contains information about 2150 languages (sam- 
pled from across the world). There are 139 typologi- 
cal /eaft/mv in this database. The database is sparse: 
only 16% of the possible language/feature pairs are 
known. We use the version extracted and prepro- 



cessed by Daume III and Campbell (2007). 

In WALS, languages a grouped into 38 language 
families (including Indo-European, Afro-Asiatic, 
Austronesian, Niger-Congo, etc.). Each of these lan- 
guage families is grouped into a number of language 
geni. The Indo-European family includes ten geni, 
including: Germanic, Romance, Indie and Slavic. 
The Austronesian family includes seventeen geni, 
including: Borneo, Oceanic, Palauan and Sundic. 
Overall, there are 275 geni represented in WALS. 

We further preprocess the data as follows. For 
the Indo-European subset (hence-forth, "IE"), we re- 
move all languages with < 10 known features and 
then remove all features that appear in at most 1/4 
of the languages. This leads to 73 languages and 
87 features. For the whole-world subset, we remove 
languages with < 25 known features and then fea- 
tures that appear in at most 1/10 of the languages. 
This leads to 349 languages and 129 features. 
5 Experiments 
5.1 Identifying Language Areas 
Our first experiment is aimed at discovering lan- 
guage areas. We first focus on the IE family, and 
then extend the analysis to all languages. In both 
cases, we use a known family tree (for the IE ex- 
periment, we use a tree given by the language genus 
structure; for the whole-world experiment, we use a 
tree given by the language family structure). We run 
each experiment with five random restarts and 2000 
iterations. We select the MAP configuration from 
the combination of these runs. 

In the IE experiment, the model identified the 
areas shown in Figure 5.1. The best area identi- 
fied by our model is the second one listed, which 
clearly correlates highly with the Balkans. There 
are two areas identified by our model (the first and 
last) that include only Indie and Iranian languages. 
While we are not aware of previous studies of these 
as hnguistic areas, they are not implausible given 



(Indie) Bhojpuri, Darai, Gujarati, Hindi, Kalami, Kashimiri, 
Kumauni, Nepali, Panjabi, Shekhawati, Sindhi (Iranian) Or- 
muri, Pasiito 

(Albanian) Albanian (Greek) Greek (Modem) (Indie) Romani 
(Kalderash) (Romance) Romanian, Romansch (Scharans), Ro- 
mansch (Sursilvan), Sardinian (Slavic) Bulgarian, Macedonian, 
Serbian-Croatian, Slovak, Slovene, Serbian 
(Baltic) Latvian, Lithuanian (Germanic) Danish, Swedish 
(Slavic) Polish, Russian 

(Celtic) Irish (Germanic) English, German, Norwegian (Ro- 
mance) French 

(Indie) Prasuni, Urdu (Iranian) Persian, Tajik 
Plus 46 non-areal languages 



Figure 2: IE areas identified. Areas that consist of just 
one genus are not listed, nor are areas with two languages. 

(Mayan) Huastec, Jakaltek, Mam, Tzutujil (Mixe-Zoque) 
Zoque (Copainala) (Oto-Manguean) Mixtec (Chalcatongo), 
Otonu (Mezquital) (Uto-Aztecan) Nahualti (Tetelcingo), Pipil 
(Baltic) Latvian, Lithuanian (Finnic) Estonian, Finnish 
(Slavic) Polish, Russian, Ukranian 

(Austro- Asiatic) Khasi (Dra vidian) Telugu (IE) Bengali 
(Sino-Tibetan) Bawm, Garo, Newari (Kathmandu) 



Figure 3; A small subset of the world areas identified. 

the history of the region. The fourth area identi- 
fied by our model corresponds roughly to the de- 
bated "English" area. Our area includes the req- 
uisite French/English/German/Norwegian group, as 
well as the somewhat surprising Irish. However, in 
addition to being intuitively plausible, it is not hard 
to find evidence in the literature for the contact re- 
lationship between English and h^ish (Sommerfelt, 
1960). 

In the whole-world experiment, the model identi- 
fied too many linguistic areas to fit (39 in total that 
contained at least two languages, and contained at 
least two language famiUes). In Figure 5.1, we de- 
pict the areas found by our model that best corre- 
spond to the areas described in Section 2.1.1. We 
acknowledge that this gives a warped sense of the 
quality of our model. Nevertheless, our model is 
able to identify large parts of the the Meso- American 
area, the Baltic area and the South Asian area. (It 
also finds the Balkans, but since these languages 
are all IE, we do not consider it a linguistic area in 
this evaluation.) While our model does find areas 
that match Meso- American and North-west Ameri- 
can areas, neither is represented in its entirety (ac- 
cording to the definition of these areas given in Sec- 
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Table 1: Area identification scores for two baseline algo- 
rithms (K-means and Pitman-Yor clustering) that do not 
use hierarchical structure, and for the Areal model we 
have presented. Higher is better and aU differences are 
statistically significant at the 95% level. 



tion 2.1.1). 

Despite the difficulty humans have in assigning 
linguistic areas. In Table 1, we explicitly compare 
the quality of the areal clusters found on the IE sub- 
set. We compare against the most inclusive areal 
lists from Section 2.1.1 for IE: the Balkans and the 
Baltic. When there is overlap (eg., Romani appears 
in both lists), we assigned it to the Balkans. 

We compare our model with a flat Pitman-Yor 
model that does not use the hierarchy. We also 
compare to a baseline K-mems algorithm. For K- 
means, we ran with K G {5, 10, 15, ... , 80, 85} 
and chose the value of K for each metric that did 
best (giving an unfair advantage). Clustering per- 
formance is measured on the Indo-European task 
according to the Rand Index, F-score, Normalized 
Edit Score (Pantel, 2003) and Normalized Variation 
of Information (Meila, 2003). In these results, we 
see that the Pitman-Yor process model dominates the 
K-mems model and the Areal model dominates the 
Pitman-Yor model. 

5.2 Identifying Areal Features 

Our second experiment is an analysis of the features 
that tend to be shared areally (as opposed to genet- 
ically). For this experiment, we make use of the 
whole-world version of the data, again with known 
language family structure. We initialize a Gibbs 
sampler from the MAP configuration found in Sec- 
tion 5.1. We run the sampler for 1000 iterations and 
take samples every ten steps. 

From one particular sample, we can estimate a 
posterior distribution over each vrj. Due to con- 
jugacy, we obtain a posterior distribution of vrj ~ 

I3et{l + En Znj, 1 + E„[l - Zn,f]). The Is come 
from the prior. From this Beta distribution, we can 
ask the question: what is the probability that a value 
of IT f drawn from this distribution will have value 
< 0.5? If this value is high, then the feature is likely 
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Table 2: Average probability of genetic for each feature 
category and the number of features in that category. 



to be a "genetic feature"; if it is low, then the feature 
is likely to be an "areal feature." We average these 
probabilities across all 100 samples. 

The features that are most likely to be areal ac- 
cording to our model are summaries in Table 2. In 
this table, we list the categories to which each fea- 
ture belongs, together with the number of features in 
that category, and the average probability that a fea- 
ture in that category is genetically transinitted. Ap- 
parently, the vast majority of features are not areal. 

We can treat the results presented in Table 2 as a 
hierarchy of borrowability. In doing so, we see that 
our hierarchy agrees to a large degree with the hier- 
archies summarized in Section 2.1.2. Indeed, (aside 
from "Tea", which we will ignore) the two most 
easily shared categories according to our model are 
phonology and the lexicon; this is in total agreement 
with the agreed state of affairs in linguistics. 

Lower in our list, we see that noun-related cat- 
egories tend to precede their verb-related counter- 
parts (nominal categories before verbal categores, 
nominal syntax before complex sentences). Accord- 
ing to Cumow (2001), the most difficult features to 
borrow are phonetics (for which we have no data), 
bound grammatical forms (which appear low on our 
list), morphology (which is 99% genetic, according 
to our model) and syntactic frames (which would 
roughly correspond to "complex sentences", another 



Indo-European 



Model 


Accuracy 


Log Prob 


Baseline 


0.635 (±0.007) 


-0.583 (±0.008) 


Areal model 


0.689 (±0.010) 


-0.526 (±0.027) 


World 


Model 


Accuracy 


Log Prob 


Baseline 


0.628 (±0.001) 


-0.654 (±0.003) 


Areal model 


0.635 (±0.002) 


-0.565 (±0.011) 



Table 3: Prediction accuracies and log probabihties for 
IE (top) and the world (bottom). 



item which is 99% genetic in our model). 
5.3 Genetic Reconstruction 

In this section, we investigate whether the use of 
areal knowledge can improve the automatic recon- 
struction of language family trees. We use King- 
man's coalescent (see Section 2.2.2) as a probabilis- 
tic model of trees, endowed with a binoinial muta- 
tion process on the language features. 

Our baseline model is to run the vanilla coalescent 
on the WALS data, effective reproducing the results 
presented by Teh et al. (2007). This method was al- 
ready shown to outperform competing hierarchical 
clustering algorithms such as average-link agglom- 
erative clustering (see, eg., Duda and Hart (1973)) 
and the Bayesian Hierarchical Clustering algorithm 
(Heller and Ghahramani, 2005). 

We run the same experiment both on the IE sub- 
set of data and on the whole- world subset. We eval- 
uate the results qualitatively, by observing the trees 
found (on the IE subset) and quantitatively (below). 
For the qualitative analysis, we show the subset of 
IE that does not contain Indie languages or Iranian 
languages (just to keep the figures small). The tree 
derived from the original data is on the left in Fig- 
ure 4, below: 

The tree based on areal information is on the right in 
Figure 4, below. As we can see, the use of areal in- 
formation qualitatively improves the structure of the 
tree. Where the original tree had a number of errors 
with respect to Romance and Germanic languages, 
these are sorted out in the areally-aware tree. More- 
over, Greek now appears in a more appropriate part 
of the tree and EngUsh appears on a branch that is 
further out from the Norse languages. 

We perform two varieties of quantitative analysis. 
In the first, we attempt to predict unknown feature 
values. In particular, we hide an addition 10% of 
the feature values in the WALS data and fit a model 
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Figure 4: Genetic trees of IE languages. (Left) with no areal knowledge; (Right) with areal model. 
Indo-European versus Genus 



Model 


Purity 


Subtree 


LOO Acc 


Baseline 


0.6078 


0.5065 


0.3218 


Areal model 


0.6494 


0.5455 


0.2528 


World versus Genus 


Model 


Purity 


Subtree 


LOO Acc 


Baseline 


0.3599 


0.2253 


0.7747 


Areal model 


0.4001 


0.2450 


0.7982 


World versus Family 


Model 


Purity 


Subtree 


LOO Acc 


Baseline 


0.4163 


0.3280 


0.4842 


Areal model 


0.5143 


0.3318 


0.5198 



Table 4: Scores for IE as compared against genus (top); 
for world against genus (mid) and against family (low). 



to the remaining 90%. We then use that model to 
predict the hidden 10%. The baseline model is to 
make predictions according to the family tree. The 
augmented model is to make predictions according 
to the family tree /or those features identified as ge- 
netic and according to the linguistic area /or those 
features identified as areal. For both settings, we 
compute both the absolute accuracy as well as the 
log probability of the hidden data under the model 
(the latter is less noisy). We repeat this experiment 
10 times with a different random 10% hidden. The 
results are shown in Table 3, below. The differences 
are not large, but are outside one standard deviation. 

For the second quantitative analysis, we use 
present purity scores (Heller and Ghahramani, 
2005), subtree scores (the number of interior nodes 
with pure leaf labels, normalized) and leave-one-out 
log accuracies (all scores are between and 1, and 
higher scores are better). These scores are computed 
against both language family and language genus as 
the "classes." The results are in Table 4, below. As 
we can see, the results are generally in favor of the 
Areal model (LOO Acc on IE versus Genus non- 
withstanding), depending on the evaluation metric. 



Radius 


Purity 


Subtree 


LOO Acc 


125 


0.6237 


0.4855 


0.2013 


250 


0.6457 


0.5325 


0.2299 


500 


0.6483 


0.5455 


0.2413 


1000 


0.6494 


0.5455 


0.2528 


2000 


0.6464 


0.4935 


0.3218 


4000 


0.6342 


0.4156 


0.4138 



Table 5: Scores for IE vs genus at varying radii. 
5.4 Effect of Radius 

Finally, we evaluate the effect of the radius hyper- 
parameter on performance. Table 5 shows perfor- 
mance for models built with varying radii. As can 
be seen by purity and subtree scores, there is a 
"sweet spot" around 500 to 1000 kilometers where 
the model seems optimal. LOO (strangely) seems 
to continue to improve as we allow areas to grow 
arbitrarily large. This is perhaps overfitting. Never- 
theless, performance is robust for a range of radii. 
6 Discussion 

We presented a model that is able to recover well- 
known linguistic areas. Using this areas, we have 
shown improvement in the ability to recover phylo- 
genetic trees of languages. It is important to note 
that despite our successes, there is much at our 
model does not account for: borrowing is known to 
be assymetric; contact is temporal; borrowing must 
obey univeral implications. Despite the failure of 
our model to account for these issues, however, it 
appears largely successful. Moreover, like any "data 
mining" expedition, our model suggests new lin- 
guistic areas (particularly in the "whole world" ex- 
periments) that deserve consideration. 
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